Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods

نویسندگان

  • Craig Sherstan
  • Brendan Bennett
  • Kenny Young
  • Dylan R. Ashley
  • Adam White
  • Martha White
  • Richard S. Sutton
چکیده

This paper investigates estimating the variance of a temporal-difference learning agent’s update target. Most reinforcement learning methods use an estimate of the value function, which captures how good it is for the agent to be in a particular state and is mathematically expressed as the expected sum of discounted future rewards (called the return). These values can be straightforwardly estimated by averaging batches of returns using Monte Carlo methods. However, if we wish to update the agent’s value estimates during learning–before terminal outcomes are observed–we must use a different estimation target called the λ-return, which truncates the return with the agent’s own estimate of the value function. Temporal difference learning methods estimate the expected λ-return for each state, allowing these methods to update online and incrementally, and in most cases achieve better generalization error and faster learning than Monte Carlo methods. Naturally one could attempt to estimate higher-order moments of the λreturn. This paper is about estimating the variance of the λ-return. Prior work has shown that given estimates of the variance of the λ-return, learning systems can be constructed to (1) mitigate risk in action selection, and (2) automatically adapt the parameters of the learning process itself to improve performance. Unfortunately, existing methods for estimating the variance of the λ-return are complex and not well understood empirically. We contribute a method for estimating the variance of the λ-return directly using policy evaluation methods from reinforcement learning. Our approach is significantly simpler than prior methods that independently estimate the second moment of the λ-return. Empirically our new approach behaves at least as well as existing approaches, but is generally more robust.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling Stock Return Volatility Using Symmetric and Asymmetric Nonlinear State Space Models: Case of Tehran Stock Market

Volatility is a measure of uncertainty that plays a central role in financial theory, risk management, and pricing authority. Turbulence is the conditional variance of changes in asset prices that is not directly observable and is considered a hidden variable that is indirectly calculated using some approximations. To do this, two general approaches are presented in the literature of financial ...

متن کامل

Investigation of Temporal Phenomena of Sediment Rating Curve and comparison of it with the Some Statistical Methods for Estimating Suspended Sediment Load (Case study: Gamasiab Watershed)

The variable and complex nature of the sediment load of rivers has led that the estimation of sediment entering the reservoirs and the production of long term sediment, for determining the lifetime of the structures encounter with the problem. Application of sediment rating curves is one of the most common methods for estimating the suspended sediment load of rivers. Regardless of the accuracy ...

متن کامل

Optimization of runoff Coefficient and Concentration Time in Estimating Flood Discharge Values by SCS Method (Case Study: Catchment Basin of Kohanrood River)

Estimation of floods in a basin with various return periods is one of the effective management strategies for reducing flood damage. One of the methods for estimating flood discharge is to make synthetic unit hydrograph using the physical characteristics of the basin. The more accurate inputs of the model, the more validated results. Hence, in basins in which Instantaneous peak discharge is rec...

متن کامل

Estimating IDF based on daily precipitation using temporal scale model

The intensity –duration –frequency (IDF) curves play most important role in watershed management, flood control and hydraulic design of structures. Conventional method for calculating the IDF curves needs hourly rainfall data in different durations which is not extensively available in many regions. Instead 24-hour precipitation statistics were measured in most rain-gauge stations. In this stud...

متن کامل

Modelling and Investigating the Differences and Similarities in the Volatility of the Stocks Return in Tehran Stock Exchange Using the Hybrid Model PANEL-GARCH

Efficient financial markets with high degree of transparency do not substantiate the hypothesis that there are differences in the volatility of return. Generally, there are factors rejecting any perfect similarity in the volatility of return in the emerging stock markets, as previous studies in Iran have confirmed the complete difference. On the other hand, the hybrid model PANEL-GARCH has the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018